FIX: [PERF] Handle utf16 strings using simdutf and std::u16string by ffelixg · Pull Request #526 · microsoft/mssql-python

ffelixg · 2026-04-15T23:50:16Z

Work Item / Issue Reference

GitHub Issue: #514

Summary

Use external dependency simdutf to decode utf16 when decoding in c++
Use pybind11 support for std::u16string to decode utf16 when interacting with python
Remove SQLWCHARToWString, WideToUTF8, Utf8ToWString and WStringToSQLWCHAR; refactor call sites

ffelixg · 2026-04-15T23:56:35Z

I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit.

This brings arrow performance on linux for nvarchars in line with what it should be.

If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace WStringToSQLWCHAR

Copilot

Pull request overview

This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).

Changes:

Add simdutf (via find_package or FetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths.
Replace a number of std::wstring usages with std::u16string for connection strings, queries, and SQL_C_WCHAR parameter buffers.
Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`mssql_python/pybind/CMakeLists.txt`	Adds `simdutf` dependency resolution and links it into the `ddbc_bindings` module.
`mssql_python/pybind/ddbc_bindings.h`	Introduces UTF-16 helpers (`dupeSqlWCharAsUtf16Le`, `utf16LeToUtf8Alloc`) and changes `ErrorInfo` to store UTF-8 `std::string`.
`mssql_python/pybind/ddbc_bindings.cpp`	Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + `simdutf`.
`mssql_python/pybind/connection/connection.h`	Updates connection APIs/state to store connection strings as `std::u16string`.
`mssql_python/pybind/connection/connection.cpp`	Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly.
`mssql_python/pybind/connection/connection_pool.h`	Updates pooling key types and APIs to use `std::u16string`.
`mssql_python/pybind/connection/connection_pool.cpp`	Implements pooling with `std::u16string` connection string keys.
`mssql_python/pybind/unix_utils.h`	Removes the `SQLWCHARToWString` declaration.
`mssql_python/pybind/unix_utils.cpp`	Removes the `SQLWCHARToWString` implementation.

Comments suppressed due to low confidence (1)

mssql_python/pybind/ddbc_bindings.cpp:583

In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses data() + SQL_NTS and sets bufferLength to size() * sizeof(SQLWCHAR). For SQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (prefer c_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.

Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.

                    std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
                        paramBuffers, param.cast<std::u16string>());
                    LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
                        "length=%zu characters, buffer=%zu bytes",
                        paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
                    dataPtr = sqlwcharBuffer->data();
                    bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
                    strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
                    *strLenOrIndPtr = SQL_NTS;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gargsaumya · 2026-04-16T02:19:28Z

/azp run

azure-pipelines · 2026-04-16T02:19:38Z

Azure Pipelines successfully started running 1 pipeline(s).

github-actions · 2026-04-16T02:25:58Z

📊 Code Coverage Report

🔥 Diff Coverage 94%	🎯 Overall Coverage 26%	📈 Total Lines Covered: `7085` out of `26978` 📁 Project: `mssql-python`

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

mssql_python/pybind/connection/connection.cpp (100%)
mssql_python/pybind/ddbc_bindings.cpp (94.3%): Missing lines 1546,1621,1638,3753,3762,4095,4197
mssql_python/pybind/utf_utils.h (92.3%): Missing lines 15-16

Summary

Total: 154 lines
Missing: 9 lines
Coverage: 94%

mssql_python/pybind/ddbc_bindings.cpp

Lines 1542-1550

  1542     LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
  1543     ErrorInfo errorInfo;
  1544     if (retcode == SQL_INVALID_HANDLE) {
  1545         LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1546         errorInfo.ddbcErrorMsg = "Invalid handle!";
  1547         return errorInfo;
  1548     }
  1549     assert(handle != 0);
  1550     SQLHANDLE rawHandle = handle->get();

Lines 1617-1625

  1617     return records;
  1618 }
  1619 
  1620 // Wrap SQLExecDirect
! 1621 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
  1622     LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
  1623         "query_length=%zu chars",
  1624         (void*)StatementHandle->get(), Query.length());
  1625     if (!SQLExecDirect_ptr) {

Lines 1634-1642

  1634         SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
  1635                            (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
  1636     }
  1637 
! 1638     SQLWCHAR* queryPtr = reinterpretU16stringAsSqlWChar(Query);
  1639     SQLRETURN ret;
  1640     {
  1641         // Release the GIL during the blocking ODBC call so that other Python
  1642         // threads (e.g. asyncio event loop, heartbeat threads) can run while

Lines 3749-3757

  3749                                      sizeof(DateTimeOffset) * fetchSize,
  3750                                      buffers.indicators[col - 1].data());
  3751                 break;
  3752             default:
! 3753                 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3754                 std::ostringstream errorString;
  3755                 errorString << "Unsupported data type for column - " << columnName.c_str()
  3756                             << ", Type - " << dataType << ", column ID - " << col;
  3757                 LOG("SQLBindColums: %s", errorString.str().c_str());

Lines 3758-3766

  3758                 ThrowStdException(errorString.str());
  3759                 break;
  3760         }
  3761         if (!SQL_SUCCEEDED(ret)) {
! 3762             std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3763             std::ostringstream errorString;
  3764             errorString << "Failed to bind column - " << columnName.c_str() << ", Type - "
  3765                         << dataType << ", column ID - " << col;
  3766             LOG("SQLBindColums: %s", errorString.str().c_str());

Lines 4091-4099

  4091                     break;
  4092                 }
  4093                 default: {
  4094                     const auto& columnMeta = columnNames[col - 1].cast<py::dict>();
! 4095                     std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  4096                     std::ostringstream errorString;
  4097                     errorString << "Unsupported data type for column - " << columnName.c_str()
  4098                                 << ", Type - " << dataType << ", column ID - " << col;
  4099                     LOG("FetchBatchData: %s", errorString.str().c_str());

Lines 4193-4201

  4193             case SQL_SS_TIMESTAMPOFFSET:
  4194                 rowSize += sizeof(DateTimeOffset);
  4195                 break;
  4196             default:
! 4197                 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  4198                 std::ostringstream errorString;
  4199                 errorString << "Unsupported data type for column - " << columnName.c_str()
  4200                             << ", Type - " << dataType << ", column ID - " << col;
  4201                 LOG("calculateRowSize: %s", errorString.str().c_str());

mssql_python/pybind/utf_utils.h

  11 #include <string>
  12 
  13 inline std::string utf16LeToUtf8Alloc(const std::u16string& utf16) {
  14     if (utf16.empty()) {
! 15         return {};
! 16     }
  17 
  18     simdutf::result utf8Length =
  19         simdutf::utf8_length_from_utf16le_with_replacement(utf16.data(), utf16.size());
  20     std::string utf8(utf8Length.count, '\0');

📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)

mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.8%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.error.h: 2.1%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.9%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16.h: 18.2%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 28.7%
mssql_python.pybind.build._deps.simdutf-src.src.simdutf.haswell.simd16-inl.h: 33.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 61.1%
mssql_python.pybind.build._deps.simdutf-src.src.generic.utf16.utf8_length_from_utf16_bytemask.h: 61.5%

🔗 Quick Links

⚙️ Build Summary	📋 Coverage Details
View Azure DevOps Build	Browse Full Coverage Report

ffelixg · 2026-04-16T23:44:55Z

Only issue seems to be that some Linux CI containers don't have git. I'm trying to fetch simdutf via url instead.

gargsaumya · 2026-04-17T03:58:12Z

/azp run

azure-pipelines · 2026-04-17T03:58:23Z

Azure Pipelines successfully started running 1 pipeline(s).

ffelixg · 2026-04-19T22:44:42Z

@gargsaumya The second CI error was due to the ubuntu image using an older version of cmake. Should be fine now I hope.

I have also replaced the remaining std::wstring occurences with std::u16string and eliminated helper functions as well as platform specific code accordingly. Let me know if you are happy with this holistic update to utf16 string handling or if you would prefer to keep it contained to the arrow fetch path.

gargsaumya · 2026-04-20T04:22:31Z

Thanks for the update @ffelixg. I actually prefer this and it LGTM overall. ButI’ll still take a closer look at the changes and get back to you.
I’ve approved the workflow for now so we can run the pipeline and get a passing test run.

gargsaumya · 2026-04-20T04:22:46Z

/azp run

azure-pipelines · 2026-04-20T04:22:56Z

Azure Pipelines successfully started running 1 pipeline(s).

ffelixg · 2026-04-20T21:56:47Z

I think that this latest CI failure was due to a flaky test. The same test / platform passed in a past CI run and I don't think my changes affected that test. Could you rerun that one @gargsaumya?

gargsaumya · 2026-04-21T02:55:57Z

/azp run

azure-pipelines · 2026-04-21T02:56:09Z

Azure Pipelines successfully started running 1 pipeline(s).

gargsaumya

Really appreciate the effort here, this is a very meaningful contribution @ffelixg. I've reviewed the PR and left 2 inline comments, please take a look!

ffelixg · 2026-04-21T23:19:49Z

Glad you like it @gargsaumya! I totally agree with both your comments and pushed changes accordingly.

gargsaumya · 2026-04-22T05:26:14Z

/azp run

azure-pipelines · 2026-04-22T05:26:24Z

Azure Pipelines successfully started running 1 pipeline(s).

ffelixg · 2026-04-22T19:11:06Z

Seems like the same flaky test failed again

jahnvi480 · 2026-05-06T09:20:26Z

/azp run

azure-pipelines · 2026-05-06T09:20:37Z

Azure Pipelines successfully started running 1 pipeline(s).

sumitmsft · 2026-05-08T15:22:46Z

I have a few generic comments before I delve into core review:
@gargsaumya

Add a short note to CONTRIBUTING.md or a pybind/README.md: "Use std::u16string for all SQLWCHAR data; use simdutf for any UTF transcoding. Do not introduce std::wstring in the C++ bindings."
The three inline helpers (utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, reinterpretU16stringAsSqlWChar) are in ddbc_bindings.h. Would recommend to them to a dedicated utf_utils.h so they can be unit-tested in isolation.
We need to subscribe to simdutf releases. They have aggressive release timelines. We would want to evaluate before bumping its versions.

sumitmsft · 2026-05-08T15:39:20Z

-                result.push_back(static_cast<wchar_t>(UNICODE_REPLACEMENT_CHAR));
-            }
-        }
+inline std::string utf16LeToUtf8Alloc(const std::u16string& utf16) {


Would recommend to move utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, and reinterpretU16stringAsSqlWChar from ddbc_bindings.h into a small utf_utils.h so they can be unit tested and the changes don't force recompilation of the whole module.

I do not see dedicated unit tests for utf16LeToUtf8Alloc, dupeSqlWCharAsUtf16Le, or reinterpretU16stringAsSqlWChar. Edge cases (empty strings, surrogate pairs, lone surrogates) are only tested indirectly through integration tests. Once moved to utf_utils.h I would suggest to add targeted unit tests.

By targeted unit tests, am I correct in assuming that you mean pure c++ unit tests? Testing them via python wrappers and pytest feels wrong. It would also involve changes to the build process as well as CI pipelines, since there is currently no infrastructure for c++ tests. The predecessors of these functions were also only tested via integration or even fully inlined as was the case for some reinterpretU16stringAsSqlWChar applications.

Good point! I agree that setting up a full C++ unit test framework and CI pipeline changes is out of scope for this PR. Let's not block this PR on it, but I'd like to open a tracking issue for it in our internal backlog for future enhancements.

I see you've already added utf_utils.h, thank you! Let's keep moving with this PR. We are expecting one more review on this before we merge it.

@gargsaumya please track this: "but I'd like to open a tracking issue for it in our internal backlog for future enhancements" in ADO.

Co-authored-by: Copilot <copilot@github.com>

ffelixg · 2026-05-10T23:17:58Z

Thanks for the review @sumitmsft.

Regarding the first point, There's actually one more method that's currently being used to convert unicode aside from simdutf and std::u16string, which is the native python c api functions in ProcessWChar (fetchall/fetchmany). I changed your suggestion accordingly and put it into the pybind README. Not sure if you meant Saumya to do the documentation, since you tagged her, feel free to change it in any way @gargsaumya.

I hope the added dependency is not too much of an inconvenience. I looked into native iconv as well, but didn't find it appealing in comparison. The simdutf releases look to me like they are mostly about adding new features, which we likely won't care about. Since unicode conversion itself is a solved problem, I was hoping that we would technically be fine using the same version forever, assuming no crazy bugs.

gargsaumya · 2026-05-11T03:58:55Z

/azp run

azure-pipelines · 2026-05-11T03:59:06Z

Azure Pipelines successfully started running 1 pipeline(s).

PERF: Use simdutf for decoding utf16

3de212c

Copilot AI review requested due to automatic review settings April 15, 2026 23:50

Copilot started reviewing on behalf of ffelixg April 15, 2026 23:52 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated

Include cstring

0e8661f

github-advanced-security AI found potential problems Apr 16, 2026

View reviewed changes

Comment thread mssql_python/pybind/ddbc_bindings.h Fixed

ffelixg added 2 commits April 17, 2026 01:22

Use url instead of git for fetching simdutf

19c46ab

Merge remote-tracking branch 'origin/main' into simdutf

4fbe283

ffelixg added 2 commits April 19, 2026 22:37

Account for older cmake versions

1b9df47

Replace remaining std::wstring uses with std::u16string

12bbb58

ffelixg changed the title ~~PERF: Use simdutf for decoding utf16~~ PERF: Handle utf16 strings using simdutf and std::u16string Apr 19, 2026

github-advanced-security AI found potential problems Apr 20, 2026

View reviewed changes

Comment thread mssql_python/pybind/ddbc_bindings.h Fixed

gargsaumya changed the title ~~PERF: Handle utf16 strings using simdutf and std::u16string~~ FIX: [PERF] Handle utf16 strings using simdutf and std::u16string Apr 20, 2026

Merge branch 'main' into simdutf

4afb7c5

gargsaumya reviewed Apr 21, 2026

View reviewed changes

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated

Comment thread mssql_python/pybind/CMakeLists.txt

Set URL_HASH & Avoid utf16LeToUtf8Alloc param copy

c71a0a3

Merge branch 'main' into simdutf

9a30c1a

gargsaumya previously approved these changes Apr 23, 2026

View reviewed changes

gargsaumya requested review from bewithgaurav and sumitmsft April 23, 2026 11:33

gargsaumya and others added 2 commits April 28, 2026 09:22

Merge branch 'main' into simdutf

f42106a

Merge remote-tracking branch 'origin/main' into simdutf

eb53cc1

ffelixg dismissed gargsaumya’s stale review via eb53cc1 April 29, 2026 18:28

Merge branch 'main' into simdutf

d66aa21

sumitmsft reviewed May 8, 2026

View reviewed changes

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated

sumitmsft reviewed May 8, 2026

View reviewed changes

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated

sumitmsft reviewed May 8, 2026

View reviewed changes

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated

sumitmsft reviewed May 8, 2026

View reviewed changes

ffelixg and others added 4 commits May 8, 2026 19:47

Fix assert message & remove windows specific comment

a4b7ac4

Move new helpers to utf_utils.h

fe341af

Merge remote-tracking branch 'origin/main' into simdutf

555e1b5

Add string handling note

007b7ac

Co-authored-by: Copilot <copilot@github.com>

github-advanced-security AI found potential problems May 11, 2026

View reviewed changes

Comment thread mssql_python/pybind/utf_utils.h Dismissed

Conversation

ffelixg commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Work Item / Issue Reference

Summary

Uh oh!

ffelixg commented Apr 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

gargsaumya commented Apr 16, 2026

Uh oh!

azure-pipelines Bot commented Apr 16, 2026

Uh oh!

Uh oh!

github-actions Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Code Coverage Report

🔥 Diff Coverage

94%

🎯 Overall Coverage

26%

Diff Coverage

Diff: main...HEAD, staged and unstaged changes

Summary

mssql_python/pybind/ddbc_bindings.cpp

mssql_python/pybind/utf_utils.h

📋 Files Needing Attention

🔗 Quick Links

Uh oh!

ffelixg commented Apr 16, 2026

Uh oh!

gargsaumya commented Apr 17, 2026

Uh oh!

azure-pipelines Bot commented Apr 17, 2026

Uh oh!

ffelixg commented Apr 19, 2026

Uh oh!

gargsaumya commented Apr 20, 2026

Uh oh!

gargsaumya commented Apr 20, 2026

Uh oh!

azure-pipelines Bot commented Apr 20, 2026

Uh oh!

Uh oh!

ffelixg commented Apr 20, 2026

Uh oh!

gargsaumya commented Apr 21, 2026

Uh oh!

azure-pipelines Bot commented Apr 21, 2026

Uh oh!

gargsaumya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ffelixg commented Apr 21, 2026

Uh oh!

gargsaumya commented Apr 22, 2026

Uh oh!

azure-pipelines Bot commented Apr 22, 2026

Uh oh!

ffelixg commented Apr 22, 2026

Uh oh!

jahnvi480 commented May 6, 2026

Uh oh!

azure-pipelines Bot commented May 6, 2026

Uh oh!

sumitmsft commented May 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sumitmsft May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

ffelixg commented Apr 15, 2026 •

edited

Loading

github-actions Bot commented Apr 16, 2026 •

edited

Loading

sumitmsft May 8, 2026 •

edited

Loading